Algorithms for extracting lines, paragraphs with their properties in PDF documents
نویسندگان
چکیده
The article discusses the algorithms for detecting and extracting lines, paragraphs with their properties attributes in PDF documents, analyses structure of PDF-file its objects. Due to special operators objects documents content is saved as symbols or symbol groups. position such groups on page also remains identical. main challenge that we face, while from document complex format able retain various types information can be created several ways.
منابع مشابه
Extracting Precise Data from PDF Documents for Mathematical Formula Recognition
As more and more scientific documents become available in PDF format, their automatic analysis becomes increasingly important. We present a procedure that extracts mathematical symbols from PDF documents by examining both the original PDF file and a rasterised version. This provides more precise information than is available either directly from the PDF file or by traditional character recognit...
متن کاملExtracting Precise Data on the Mathematical Content of PDF Documents
As more and more scientific documents become available in PDF format, their automatic analysis becomes increasingly important. We present a procedure that extracts mathematical symbols from PDF documents by examining both the original PDF file and a rasterized version. This provides more precise information than is available either directly from the PDF file or by traditional character recognit...
متن کاملExtracting Parallel Paragraphs from Common Crawl
Most of the current methods for mining parallel texts from the web assume that web pages of web sites share same structure across languages. We believe that there still exists a nonnegligible amount of parallel data spread across sources not satisfying this assumption. We propose an approach based on a combination of bivec (a bilingual extension of word2vec) and locality-sensitive hashing which...
متن کاملA Hierarchical Neural Autoencoder for Paragraphs and Documents
Natural language generation of coherent long texts like paragraphs or longer documents is a challenging problem for recurrent networks models. In this paper, we explore an important step toward this generation task: training an LSTM (Longshort term memory) auto-encoder to preserve and reconstruct multi-sentence paragraphs. We introduce an LSTM model that hierarchically builds an embedding for a...
متن کاملExtracting Objects and Their Attributes from Tables in Text Documents
Extracting information from tables is an important and rather complex part of information retrieval. For the task of objects extraction from HTML tables we introduce the following methods: determining table orientation, processing of aggregating objects (like Total) and scattered headers (super row labels, subheaders).
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: E3S web of conferences
سال: 2023
ISSN: ['2555-0403', '2267-1242']
DOI: https://doi.org/10.1051/e3sconf/202338908024